0%

Foundations for Big Data Analysis with SQL

Foundations for Big Data Analysis with SQL

This is my learning note of coursera course: [Foundations for Big Data Analysis with SQL][https://www.coursera.org/learn/foundations-big-data-analysis-sql/].


Week 1: Data and Databases

What is Data?

  • Digital data: information that can be transmitted, stored, and processed using modern digital technologies (comparing to analog data)
  • Data is a representation of something that captures some features and ignores others.

Why Organize Data?

  • the organization of data has a major impact on how easily you can use the data to answer questions

What Does a DBMS Do?

  • design the kinds of data your database will hold (design), change what data you have in your database (update), get data out of it (retrieve), and manage who has access to different parts of your data (manage)

Relational Databases and SQL

  • Different language types: DDL, DML, DQL, DCL (Data Definition/Manipulation/Query Control Language)

Reference

Week 2: Relational Databases and SQL

Data Types

  • BLOB (Binary Large Object) can store large binary data like a photograph, video, audio, or other media file
  • CLOB (Character Large Object) can store large character data like an HTML web page, or a complete book

Primary Keys and Foreign Keys

  • primary key is immutable (no change allowed)
  • each column of the compound key is itself a foreign key: the columns refer to the primary keys in the other tables

Database Design

  • Database normalization: is a strategy whereby you design each table so that it obeys certain organizational conditions or rules (e.g. 1NF, 2NF, …) (business community generally accepts 3NF)
  • 3NF: primary key; atomic columns, no repeating groups, non-key columds describe only the whole key, no derived columns
  • Denormalization: the strategy of denormalization is to consciously, deliberately “break” one or more of the rules of database normalization in your design
  • Denormalization examples: allowing duplicate rows, pre-joined tables, derived columns, summary tables
  • Differences: data anomalies, ecforcing data structure, size, SELECT speed
  • Reference: A Simple Guide to Five Normal Forms in Relational Database Theory

ACID

  • Atomicity, Consistency, Isolation, Durability

Week 3: Big Data

Volumn

  • what is the minimum volume for big data? 30TB?

  • The big data platform Apache Hadoop includes

    a file system called the Hadoop Distributed File System, or HDFS. In HDFS, a single block is usually of size 128 megabytes.

  • Do distributed process, parallel reads

Variety (and Velocity)

  • can classify digital data as structured, semi-structured, or unstructured (no strict boundaries)
  • structure data: data that conforms to a set schema
  • unstructured data: data without clear, definite structure (e.g. natural language text)
  • semi-structured data: a record are tagged, but there is no definite schema that all records are guaranteed to meet
  • The velocity of data—the speed at which data is generated in modern systems—is another dimension that has grown notably in recent years with the rise of the internet, especially social media.

Big Data Systems

  • RDBMS strengths: enforcing business rules, transactions (OLTP), structure, many good choices, strong with small or medium data, simple security controls, fast (at reasonable scale), lots of tools and solutions
  • RDBMS limitations: schema on write, high cost of storage, weak support for unstructured data, difficulty of distributed transactions

SQL and Data

  • Queries on structured data:
    • combine records from multiple tables
    • group records by categorical values or ranges
    • compute counts or summary terms on any groups
    • sort on any resulting values

Week 4: SQL Tools for Big Data Analysis

NoSQL

  • they provide simple DML and query commands and they physically organize records by a specific lookup key. Records can be stored in massive numbers and then a record can be found rapidly by its lookup key, but not easily by other values in the record
  • These systems are best for unstructured or semi-structured data with only a few, well-defined access patterns.

Non-transactional, Structured Systems

  • between NoSQL and full RDBMS systems
  • e.g. Kudu. Enforcing primary key but not foreign key constraints.

Big Database Types

  • analytic system (data warehouses) (e.g. Impala, Hive)
  • operational systems
    • non-transactional, unstructured or semi-structured (e.g. HBase, MongoDB)
    • non-transactional, structured (e.g. Kudu)
    • ACID-compliant RDBMSs (e.g. Splice Machine, Trafodian, Phoenix)
  • search (e.g. Solr, Elasticsearch)

Features of SQL for Big Data Analysis

  • Challenges: distributed transactions, data variety
  • What We Keep: SELECT, seeing data as tables, DDL, DCL
  • What We Give Up: unique columns, primary & foreign key constraints, synchronized indexes, triggers and stored procedures, UPDATE and DELETE
  • What We Add: table partitions and bucketing (bucketing divides a table in an essentially non-predictable way, while table partitioning uses a predictable method to divide a table), support for may file formats (e.g. csv, xml, json), complex data types

Storing Big Data

  • places: on premises / in cloud / hybrid (using both)